Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix out-of-bounds access in module_diag_hailcast.F90 which crashes RRFS on WCOSS2 #2064

Merged
merged 24 commits into from
Jan 17, 2024

Conversation

SamuelTrahanNOAA
Copy link
Collaborator

@SamuelTrahanNOAA SamuelTrahanNOAA commented Dec 27, 2023

Commit Queue Requirements:

  • Fill out all sections of this template.
  • All sub component pull requests have been reviewed by their code managers.
  • Run the full RT suite (compared to current baselines) on either Hera/Derecho/Hercules AND have committed the log to my PR branch.
  • Add list of all failed regression tests in "Regression Tests" section.

PR Information

Description

Fixes a bug that can crash the RRFS ensembles. When KBAS=1, there's an out-of-bounds write in an array. That corrupts memory and occasionally crashes the model.

Also, I remove from default_vars.sh the unused DO_MYJPBL variable whose value contains a typo.

Commit Message

Eliminate an out-of-bounds write in FV3 dynamical core hail diagnostic code.

FV3 component GFDL_atmos_cubed_sphere submodule:

The module_diag_hailcast.F90 has a KBAS index which can have the value of 1. When it has that value, a later line writes to address 0 of a 1-based array. This caused the RRFS ensemble to crash on WCOSS2.

Priority

  • Critical Bugfix (This PR contains a critical bug fix and should be prioritized.)
  • High (This PR contains a feature or fix needed for a time-sensitive project (eg, retrospectives, implementations))
  • Normal

Blocking Dependencies

Git Issues Fixed By This PR

Changes

Subcomponent (with links)

Input data

  • No changes are expected to input data.

Regression Tests:

  • No changes are expected to any regression test.

Libraries

  • Not Needed

Testing Log:

  • RDHPCS
    • Hera
    • Orion
    • Hercules
    • Jet
    • Gaea
    • Derecho
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
    • Completed
  • opnReqTest
    • N/A
    • Log attached to comment

@SamuelTrahanNOAA SamuelTrahanNOAA changed the title Fix out-of-bounds write in module_diag_hailcast.F90 which crashes RRFS on WCOSS2 Fix out-of-bounds access in module_diag_hailcast.F90 which crashes RRFS on WCOSS2 Dec 27, 2023
@SamuelTrahanNOAA SamuelTrahanNOAA marked this pull request as draft December 27, 2023 20:05
@SamuelTrahanNOAA
Copy link
Collaborator Author

This is a draft because I discovered I used the wrong fix. I need to do this instead:

-    DO k=KBAS,nz
+    DO k=KBAS+1,nz
        RWA_new(k) = RWA_new(k) / (h1d(k)-h1d(k-1))
     ENDDO

I'm testing this now on WCOSS2 and Hera.

@SamuelTrahanNOAA
Copy link
Collaborator Author

The correct fix works too. This is ready for review.

@SamuelTrahanNOAA SamuelTrahanNOAA marked this pull request as ready for review December 28, 2023 02:23
@jkbk2004
Copy link
Collaborator

jkbk2004 commented Jan 2, 2024

@SamuelTrahanNOAA do you think we can combine this pr to @dustinswales 's #2035 ? All no baseline changes. For this pr, just need to point to the cubed-sphere branch.

@SamuelTrahanNOAA
Copy link
Collaborator Author

@SamuelTrahanNOAA do you think we can combine this pr to @dustinswales 's #2035 ? All no baseline changes. For this pr, just need to point to the cubed-sphere branch.

This PR needs more testing and review before it can be merged. The original developer of the code needs to confirm my fix is correct. Also, the parallels are still testing the change.

@SamuelTrahanNOAA SamuelTrahanNOAA marked this pull request as draft January 2, 2024 18:37
@SamuelTrahanNOAA
Copy link
Collaborator Author

I've changed this to a draft to reflect the fact that it needs more review.

@SamuelTrahanNOAA
Copy link
Collaborator Author

There are more errors in the module_diag_hailcast.F90 code than I first thought. I'm waiting to hear back from the original developer about how to fix them.

@SamuelTrahanNOAA SamuelTrahanNOAA marked this pull request as ready for review January 12, 2024 16:01
@SamuelTrahanNOAA
Copy link
Collaborator Author

@jkbk2004 - This has been ready for review for several days. I forgot to mark it so.

@jkbk2004
Copy link
Collaborator

@SamuelTrahanNOAA can you sync up branch? wcoss2 file system maintenance this week. So we may go ahead with no baseline change PR first.

@SamuelTrahanNOAA
Copy link
Collaborator Author

can you sync up branch? wcoss2 file system maintenance this week. So we may go ahead with no baseline change PR first.

@jkbk2004 - Yes, it is up to date now.

@jkbk2004 jkbk2004 added No Baseline Change No Baseline Change Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked. labels Jan 16, 2024
@jkbk2004
Copy link
Collaborator

@zach1221 @FernandoAndrade-NOAA @BrianCurtis-NOAA FYI: this PR is ready.

@BrianCurtis-NOAA
Copy link
Collaborator

If we're skipping WCOSS2 we can't properly test this PR, right?

@jkbk2004
Copy link
Collaborator

If we're skipping WCOSS2 we can't properly test this PR, right?

@BrianCurtis-NOAA @SamuelTrahanNOAA I think we should move on to merge this PR even only with RDHPCS tests. Government shutdown risk is not still cleared out. In that case, GFDL feds will not be available.

@SamuelTrahanNOAA
Copy link
Collaborator Author

If we're skipping WCOSS2 we can't properly test this PR, right?

@BrianCurtis-NOAA @SamuelTrahanNOAA I think we should move on to merge this PR even only with RDHPCS tests. Government shutdown risk is not still cleared out. In that case, GFDL feds will not be available.

There has been some minimal testing on WCOSS with RRFS. That isn't as thorough as a regression test.

Is Acorn available?

@BrianCurtis-NOAA
Copy link
Collaborator

If we're skipping WCOSS2 we can't properly test this PR, right?

@BrianCurtis-NOAA @SamuelTrahanNOAA I think we should move on to merge this PR even only with RDHPCS tests. Government shutdown risk is not still cleared out. In that case, GFDL feds will not be available.

There has been some minimal testing on WCOSS with RRFS. That isn't as thorough as a regression test.

Is Acorn available?

@SamuelTrahanNOAA was this issue also on Acorn. I would be OK if the WCOSS2 issue is present here, and using the Acorn RT to confirm it's fixed.

@SamuelTrahanNOAA
Copy link
Collaborator Author

If we're skipping WCOSS2 we can't properly test this PR, right?

I think we should move on to merge this PR even only with RDHPCS tests. Government shutdown risk is not still cleared out. In that case, GFDL feds will not be available.

There has been some minimal testing on WCOSS with RRFS. That isn't as thorough as a regression test.
Is Acorn available?

was this issue also on Acorn. I would be OK if the WCOSS2 issue is present here, and using the Acorn RT to confirm it's fixed.

I'm more concerned about whether the results change for other tests.

We know this doesn't stop all the crashes. That's why Anning has another fix. There's a bug fix in the DA code. Most likely, there'll be a dozen more bug fixes before this is operational.

@SamuelTrahanNOAA
Copy link
Collaborator Author

I've retested it with one of the failing cases. The test was on Hera, but it fails in debug mode reliably on Hera without the fix.

The latest fix still works.

@BrianCurtis-NOAA
Copy link
Collaborator

For now, Acorn is a full pass. Not happy with these long term WCOSS2 outages

@SamuelTrahanNOAA
Copy link
Collaborator Author

For now, Acorn is a full pass. Not happy with these long term WCOSS2 outages

Acorn was created to lessen the impact of these outages, and also security concerns. WCOSS2's predecessors didn't have a mid-sized test system. You are witnessing the Acorn test machine serving its intended purpose.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Jan 17, 2024

OK! I think we can move on for merging process. @SamuelTrahanNOAA I am moving to NOAA-GFDL/GFDL_atmos_cubed_sphere#308 to ask merging there.

@jkbk2004
Copy link
Collaborator

@SamuelTrahanNOAA FV3 pr was merged.

@SamuelTrahanNOAA
Copy link
Collaborator Author

I've reverted the .gitmodules changes and updated the FV3 hash to NOAA-EMC develop's head.

This PR is ready for final review and merge.

@jkbk2004 jkbk2004 merged commit e7380dd into ufs-community:develop Jan 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
No Baseline Change No Baseline Change Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Typo when setting DO_MYJPBL to true in certain RTs
5 participants